Exploratory data analysis - red wine

Xiaodont Tan

1 Univariate Plots Section

In this section, I examined the structure of the dataset, as well as all the variables in the dataset, including the quality and the attributes of the red wine.

1.1 Data summary

There are 13 variables in the dataset, including an index variable X, 12 features of wine, as well as the quality of wine.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

1.2 Quality

The qualities of the wines are ranging from 3 to 8. Most of them are of quality 5 and 6.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

I grouped the wine quality into low (3~5) and high (6~8), each category containing about half of the dataset.

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
## 
##  Low High 
##  744  855

1.3 Fixed acidity

The volatile acidity is ranging from 4.6 to 15.9, roughly following a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

1.4 Citric acid

The citric acid ranges from 0 to 0.8, with a few outliers at around 1. The data is right-skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

1.5 Residual sugar

The normal range of residual.sugar is 0.9 to 9.0. Again, there are a few outliers with values much larger than this range (from 13 to 16).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

As the data has long-tail, the data was transformed to log data to have a better understanding of its distribution.

1.6 Chlorides

The normal range of cholorides is 0.012 to 0.3. Again, there are a few outliers with values much larger than this range (from 0.4 to 0.6).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

As the data has long-tail, the data was transformed to log data to have a better understanding of its distribution.

1.7 Volatile acidity

The volatile acidity is slightly right-skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

1.8 Free sulfur dioxide

The total sulfur dioxide is right-skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

1.9 Total sulfur dioxide

The total sulfur dioxide is right-skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

1.10 Density

The density roughly follows a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

1.11 pH

The pH value roughly follows a normal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

1.12 Sulphates

The level of sulphates is ranging from 0.33 to 1.5, with some outliers over 1.5.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

1.13 Alcohol

Alcohol is right-skewed.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

2 Univariate Analysis

2.1 What is the structure of your dataset?

There are 1599 red wine observations in the dataset with 13 variables, including an index variable (named “X”), the “quality” variable, and 11 other variables describing the chemical attributes of red wine.

The quality of the wine is an integer. It is a discrete value.

All the chemical attributes are floating numbers. They are of different unit and therefore lie in widely different range.

2.2 What is/are the main feature(s) of interest in your dataset?

Quality of the wine is the main feature of interests. From common sense, I would expect alcohol also plays an important role in the quality of the wine.

2.3 What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

All the other features of wine are potentially linked to its quality. From the description of the variables, I would expect volatile acidity and citric acid have influence on the quality.

2.4 Did you create any new variables from existing variables in the dataset?

I grouped the quality data into high quality group and low quality group, each containing around half of the dataset.

3 Bivariate Plots Section

In this section, I explored the relationship between different variables. I In particular I plotted a few relatively strong relationships between different wine attributes, as well as between attributes and wine quality.

3.1 Summary of the relationship between different variables

The correlation between different variables in the dataset is shown below.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00

The strengths of the correlation relationships are shown in the chart below.

3.2 The relationship between the attributes

The correlation matrix suggests that fixed.acidity is strongly positively correlated with citric.acid and density (r= 0.67), strongly negatively correlated with ph ( r = -0.68).

The higher the level of fixed acidity is, the higher the level of citric acid is.

The higher the level of fixed acidity is, the higher the density is.

The higher the fixed acidity is, the lower the pH level is.

The relationship between free.sulfur.dioxide and total.sulfur.dioxide is also strong (r = 0.67).

The negative correlation relationship between density and alcohol is also relatively strong ( r = -0.5). The higher the density is, the lower the alcohol level is.

3.3 The relationship between the attributes and the quality

The correlation matrix suggestest that volatile.acidity, alcohol, citric acid and sulphates are weekly correlated with quality (r = -0.39, 0.48, 0.23 and 0.25 respectively)

The higher the wine quality is, the higher the alcohol level is (there is an exception for wine with quality 5).

This difference between quality wine and high quality wine is statistically significant.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  wine$alcohol by wine$quality.rank
## W = 154810, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The higher the quality is, the lower the level of volatile acidity is. The difference between low and higher quality wine is statistically significant.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  wine$volatile.acidity by wine$quality.rank
## W = 438910, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The higher the wine quality is, the higher the level of citric acid is. The difference between low and high quality wine is statistically significant.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  wine$citric.acid by wine$quality.rank
## W = 259850, p-value = 2.555e-10
## alternative hypothesis: true location shift is not equal to 0

The higher the wine quality is, the higher the level of sulphates is. The difference between low and higher quality is statistically significant.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$quality and wine$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  wine$sulphates by wine$quality.rank
## W = 195150, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

4 Bivirable Analysis

4.1.Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The more alcohol, citric.acid, sulphates and the less volatile acidity the wine contains, the higher its quality is.

4.2 Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I obeserved that fixed acidity has strong correlation with a few other attributes. It is strongly positively correlated with citric.acid and density (r= 0.67), negatively correlated with ph (r = -0.68).

There is also a strong positive correlation between free.sulfur.dioxide and total.sulfur.dioxide.

4.3 What was the strongest relationship you found?

The strongest correlation I found was the one between pH and fixed.acidity r = -0.68

5 Multivariate Plots Section

5.1 Fixed acidity vs Alcohol by Quality

The graph shows the relationship between alcohol and fixed acidity for different wine quality. When the fixed acidity is not very high (4 ~ 10), the alcohol level of high quality wine is higher than that of low quality wine. This is inline with the previous observation that the alcohol level and wine quality are positively correlated. When the fixed acidity is very high (>13), however, the low quality wine has more alcohol than high quality wine.

5.2 Residual.sugar vs density by quality

When the level of residual sugar is low, the level is positively correlated with density, and the density of low quality wine is mostly higher than of high quality wine. However, when the level of residual sugar is higher (>4), the patterns disappear. (Outliers are removed from the chart)

5.3 Sulphates vs Density by Quality

When the level of sulphate is low, it is positively correlated with density, and the density of low quality wine is mostly higher than of high quality wine. However, when the level of sulphate is higher, the patterns disappear. (Outliers are removed from the chart)

5.4 Multiple regression model

Model 1

In model 1, all the attributes were used as predictors, the adjusted R-squared was only 0.3561.

## 
## Call:
## lm(formula = quality ~ fixed.acidity + citric.acid + residual.sugar + 
##     chlorides + volatile.acidity + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
## fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
## citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
## residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
## chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
## total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## density              -1.788e+01  2.163e+01  -0.827   0.4086    
## pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
## sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
## alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

Model 2

In model 2, only the attributes that are significant predictors in model 1 were used as predictors. However, the adjusted R-squared was only increased to 0.3567, which is not good enough.

## 
## Call:
## lm(formula = quality ~ chlorides + volatile.acidity + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68918 -0.36757 -0.04653  0.46081  2.02954 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.4300987  0.4029168  10.995  < 2e-16 ***
## chlorides            -2.0178138  0.3975417  -5.076 4.31e-07 ***
## volatile.acidity     -1.0127527  0.1008429 -10.043  < 2e-16 ***
## free.sulfur.dioxide   0.0050774  0.0021255   2.389    0.017 *  
## total.sulfur.dioxide -0.0034822  0.0006868  -5.070 4.43e-07 ***
## pH                   -0.4826614  0.1175581  -4.106 4.23e-05 ***
## sulphates             0.8826651  0.1099084   8.031 1.86e-15 ***
## alcohol               0.2893028  0.0167958  17.225  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared:  0.3595, Adjusted R-squared:  0.3567 
## F-statistic: 127.6 on 7 and 1591 DF,  p-value: < 2.2e-16

6 Multivariate Analysis

6.1 Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

When the fixed acidity is not very high (4 ~ 10), the alcohol level of high quality wine is higher than that of low quality wine. This is inline with the previous observation that the alcohol level and wine quality are positively correlated. When the fixed acidity is very high (>13), however, the low quality wine has more alcohol than high quality wine.

6.2 Were there any interesting or surprising interactions between features?

For residual.sugar and sulphates, when the its level is low, the level is positively correlated with density, and the density of low quality wine is mostly higher than of high quality wine. However, when the levels are higher, the patterns disappear.

6.3 OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created multiple regression model on the quality of wine. In model 1, all the attributes were used as predictors, the adjusted R-squared was only 0.3561. In model 2, only the attributes that are significant predictors in model 1 were used as predictors. However, the adjusted R-squared was only increased to 0.3567, which is not good enough.

The two models took all the possible attributes into consideration. However, some attributes are correlated to each other, which might influence the goodness of the model. The influence of some attributes might also not be linear.

7 Final Plots and Summary

Plot One

Description One

One major finding of the project is that the alcohol level is an important indicator of the wine’s quality. Wines with higher quality (quality 6, 7, 8) contains much more alcohol than wines with lower quality (quality 3, 4, 5).

Plot Two

Description Two

The relationship between alcohol level and wine quality has interaction with other attributes of the wine. For example, when the level of fixed acidity is below 10, high quality wine has higher level of alcohol. When the level of fixed acidity is above 10, however, the pattern disappears.

The plot is also an example that the relationship between different attributes might only hold in a certain range. For example, the level fixed acidity is negatively related to the level of alcohol when the level of fixed acidity is below 8. When the level of fixed acidity is above 8, however, the pattern disppears.

Plot Three

Description Three

The plot below shows another example of interaction between different variables.

The low quality wine has lower level of sulphates, as there are more red dots at the left of the plot.

When the level of sulphates is low (below 1), low quality wine has higher density, as the red line is above the green line. At the same time, the sulphates level is roughly positively related to density.

When the level of sulphates is high, no apparent patterns were identified.

Reflections

Correlation coefficient shows that volatile.acidity, alcohol, citric acid and sulphates have stronger correlation with the wine quality. Higher quality wine has higher level of alcohol, citric acid and sulphates and lower level of volatile acidity.

In the multiple regression model, however, the significant predictors of wine quality include alcohol, volatile acidity, sulphates, chlorides, ph, free.sulfur.dioxide and total.sulfur.dioxide.

Plotting shows that in fact, the relationship between some factors are not linear, some correlation relationship only holds in a certain range. Some relationships also interacts with other factors.

In this dataset, the majority of the wine are of quality 5 and 6, which could actually be categorized as “medium” quality wine if 3,4 is categorized as low quality and 7,8 is categorized as high quality. However, that would make the size difference between different groups too big. As a result, I grouped the wine quality into low (3,4,5) and high (6,7,8) instead. The two group analysis might not be able to capture the features of medium quality wine.

In this project, although log transformation were conducted for some varibles in univariable analysis, bivariate and muitivariate analysis were conducted only on the original data, not the transformed data.Further models can be built on some transformed data.

Due to the unfamilarity with the ggplot, I struggled a lot with choosing the right geom and the parameters. Some of the plots could be fine tuned to look nicer or more explicit.

Besides, a research on the red wine might give more insights on which variables to focus on and how to interpret the findings.

Reference

http://www.jerrydallal.com/lhsp/logs.htm https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/legend.html http://stackoverflow.com/questions/8460257/constraining-stat-smooth-to-a-particular-range https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/grid.html http://docs.ggplot2.org/current/labs.html http://www.sthda.com/english/wiki/add-legends-to-plots-in-r-software-the-easiest-way http://docs.ggplot2.org/current/geom_jitter.html https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html http://docs.ggplot2.org/current/scale_continuous.html